Exploration of Different Imputation Methods for Missing Data
Author
Karthik Aerra, Liz Miller, Mohit Kumar Veeraboina, Robert Stairs
Published
July 27, 2024
Introduction
Missing data is a key challenge that data professionals encounter in their daily work. In the real world, data is often messy, incomplete, and requires a level of preprocessing before data analysis can occur. Part of data preprocessing is dealing with missing data. Missing data can occur for several reasons including but not limited to processing error, machine or measurement errors, non-responses in surveys, invalid calculations (e.g. division by zero), and study participant dropout (Emmanuel 2021). Missing data can be problematic for analysis because it can reduce the number of usable observations, introduce bias, and interfere with different analysis tools such as machine learning algorithms, which cannot automatically handle missing data.
One of the first steps in handling missing data is to determine how much of the data is missing and from which columns. Missing data in a dataset can be represented in several ways including NA, NaN, Null, None, blanks, spaces, empty strings, or placeholder values such as 999999 (or equivalent obvious numerical outlier). For standard missing values such as NAs, functions such as is.na() in R can be used to count the number of missing observations. For string-related missing values such as space (“ ” or empty strings (“”), functions such as unique() in R can be used to identify unexpected strings. For numerical placeholders such as 0, 99999 (or equivalent), or negative values, histograms of the data can be useful to identify missing values.
If a dataset contains missing data, there are two basic approaches: deletion or imputation (Schefer 2002). Deletion of data can be problematic for analysis, especially when the proportion of missing data is high or the overall number of observations is low. Deletion of data can also lead to bias or underestimation of variance if the data are not missing completely at random. Imputation is the process of replacing missing values with estimates, which can be accomplished using statistical or machine learning methods. It is crucial when choosing an approach to missing data that the underlying mechanism(s) for the missing data are understood.
When data are MCAR, the probability of a record missing is independent from observed and unobserved data. In other words, the likelihood of an observation having missing data is the same across all observations (Buuren 2018). The pattern for missing data is unrelated to the data itself. For example, if a continuous monitoring system periodically loses network connectivity or power, and that loss cannot be attributed to other factors (e.g. random), this missingess pattern can be considered MCAR. In the case of MCAR data, the deletion of data reduces the number of observations in the dataset but does not introduce bias. MCAR is the most ideal, but least realistic scenario. The probability function for MCAR data can be described as follows:
Where Yobs is the observed data, Ymis is the missing data, and ψ contains the missing data model parameters. The above statement indicates that the missing data are not dependent on observed data or missing data.
Missing at Random (MAR):
When data are MAR, the probability of a record missing depends on the observed data, but not the unobserved data (Buuren 2018). If performing a survey with age as a variable, women may be more likely to omit their age in the survey. The probability of an observation having a missing data point for age can be therefore be attributed to another observed variable. In the case of MAR, deletion of entire observations with missing data may or may not introduce bias. MAR is typically the underlying assumption for mosy imputation methods. The probability function for MAR data can be described as follows:
Where Yobs is the observed data, Ymis is the missing data, and ψ contains the missing data model parameters. The above statement indicates that the missing data are dependent on observed data but not missing data.
Missing Not at Random (MNAR):
When data are MNAR, the probability of a record missing is dependent on the unobserved data. The probability of a missing datapoint is variable, but cannot be accounted for with other known, observed variables. There may be a factor that can predict the probability of a missing value, but it is unknown. MNAR data are the most complicated to analyze. Similarly to MAR, deletion of observations with missing data may or may not introduce bias to the analysis. The probability function for MNAR data can be described as follows:
Where Yobs is the observed data, Ymis is the missing data, and ψ contains the missing data model parameters. The above statement indicates that the missing data could be dependent on observed data or missing data.
Figure 1: MCAR, MAR, and MNAR Example, where M is the probability of a missing data point, X is the observed data, and Y is the unobserved data. (Du, Hu, and Zhang 2020)
Understanding the mechanism behind missing data (MCAR, MAR, or NMAR) is important because many imputation methods require the assumption of MCAR or MAR to be held true. Statistical tests such as Little’s MCAR test, Hawkin’s test and non-parametric test can be used to support whether data or MCAR or not. These tests will be discussed further in the methods section. Visualization of the missing data is also helpful for determining the relationships (or lack thereof) within the missing data. For example, correlation matrix plots, scatterplots, and other custom visualizations can be useful tools in identifying patterns in the missing data (Ghazali, Shaadan, and Idrus 2020).
One of the approaches to handling missing data is to delete instances of missing values. This can be accomplished through either listwise deletion or feature selection, for example. In listwise deletion, all rows with missing values are deleted from the dataset. This results in a dataset with a lower number of observations. In feature selection, columns with high proportion of missing data are deleted from the dataset. In this way, the number of observations in the dataset is maintained, but the total number of features is reduced. Deletion methods are advantageous in that they are quick and simple. With a large dataset and relatively low amounts of missing data, reducing the number of observations or features may be okay. However, deletion can introduce bias into the analysis if the data are not missing completely at random, which is often the case (Ren et al. 2023), (Emmanuel 2021).
Single imputation:
In single imputation, missing data values are replaced with a single value, without consideration of the uncertainty around the prediction for the missing value. In other words, a single value is predicted for each missing data point. This can be accomplished using techniques such as mean imputation, median imputation, hot and cold deck imputation and regression-based techniques. There is not a model defined for predicting the values in the column as a whole, so there is information lost around the variance of the prediction. Single imputation methods are simpler than multiple imputation methods, but they tend to underestimate the variance resulting models may have artifically low confidence intervals (Ren et al. 2023), (Emmanuel 2021).
Multiple imputation:
In multiple imputation methods such as MICE, a model is constructed to predict missing values in each column based on observed data in other columns. This model provides a point estimate for the missing values, as well as the uncertainty around the predictions. Multiple iterations are performed to predict the missing values. In each iteration, a missing value is predicted based on the model and its uncertainty. That is, with multiple iterations, you will end up with a distribution of possible missing values. In this way, multiple datasets with multiple possible values for the missing data are generated. Once the iterations are complete, the missing data predictions can consolidated back into a single dataset, where the missing values will be a result of the point estimate and the variance or uncertainty from the model. Multiple imputation can be a powerful technique that is helpful in preserving some of the variance or uncertainty around missing values, but it can be complex and computationally expensive. (Ren et al. 2023), (Emmanuel 2021).
Imputation methods can be further broken down into statistical versus machine learning methods. According to (Lin and Tsai 2020)’s review of literature from 2006-2017, four of the most common statistical methods for missing value imputation include expectation management (EM), linear regression, least squares (LS), and mean/mode imputation. For machine learning, the four most common methods include clustering, decision trees, random forest, and KNN.
Theories for Missing Data
Rubin’s Missing Data Theory:
This foundational theory categorizes missing data into MCAR, MAR, and NMAR. It provides a framework for understanding the mechanisms behind missing data and guides the selection of appropriate imputation strategies based on these mechanisms(Rubin 2004).
Statistical Theory of Imputation:
This theory involves using statistical models to estimate missing values, leveraging the relationship between observed and missing data points. Multiple imputation, a prominent technique within this theory, generates multiple plausible values for each missing data point to account for uncertainty (Schafer 1997).
Machine Learning Theories:
Machine learning techniques are increasingly employed for imputing missing data due to their ability to learn patterns from data and make predictions. Methods such as k-nearest neighbors (KNN) and neural networks are applied to predict missing values based on observed data patterns (Batista and Monard 2003).
These theories provide a comprehensive framework for understanding and implementing imputation methods across various datasets and analytical contexts.
Research Methods for Missing Data:
Descriptive Studies:
These studies analyze the patterns of missing data and evaluate the effectiveness of various imputation methods. Examples include the work by Jerez et al. and Güzel (Jerez 2010), (Guzel 2013).
Comparative Studies:
These studies compare the performance of different imputation methods. Joseph G. Ibrahim and Siming Zheng have conducted such comparative analyses, highlighting the strengths and weaknesses of various approaches (Ibrahim 2011), (Zhang 2024), (Zheng, Zhang, and Zhou 2023).
Simulation Studies:
These studies test imputation methods on controlled datasets, allowing researchers to systematically observe the impact of different techniques. This helps in understanding the conditions under which each method performs best.
Application-Based Studies:
These studies apply imputation methods to real-world data, demonstrating their practical use and the challenges involved. Examples include the works by Saqib Ejaz Awan and Emmanuel, which show how imputation methods are implemented in practical scenarios (Awan 2022), (Emmanuel 2021).
Theoretical Analysis:
Critical decisions based on statistical theories and useful techniques are involved in addressing missing data in datasets. By categorizing missing data into MCAR, MAR, and MNAR, Rubin’s Missing Data Theory offers a fundamental framework for choosing the best imputation techniques. Statistical Theory of Imputation estimates missing values by using models such as multiple imputation, which preserves uncertainty by making many credible guesses. In contrast, machine learning theories use data-driven insights to forecast missing values based on observed patterns using algorithms like KNN and neural networks. Many ways are available with trade-offs in simplicity, accuracy, and processing needs. Examples of these techniques are deletion methods, single imputation (e.g., mean imputation), and multiple imputation (e.g., MICE). These techniques are validated by comparative and simulation studies, which assess their efficacy in various circumstances and datasets. Comprehending these theories and approaches is essential for minimizing biases and optimizing data utility in analytical tasks, highlighting the multidisciplinary aspect of tackling difficulties related to missing data in practical applications.
Related Work
Literature Research and Review
The approach used to select and analyze the literature included in this review involved identifying articles relevant to the topic of missing data and the imputation thereof. Articles were selected based on their topic relevance as it related to statistical and machine learning concepts. Preference was given to articles that were published recently to ensure the inclusion of the latest methods and research in the field. In addition, articles were prioritized from peer-reviewed and reputable conference papers to ensure quality and reliability of the research. A comprehensive search using Google Scholar and the available databases accessed through the University of West Florida was implemented. Key search terms included: ‘missing data’, ‘missing data imputation’.
Articles were selected based on their relevance to the subject, in addition to including papers with a variety of methods implemented. Research addressing both statistical and machine learning approaches were focused on. Articles not available in full text, not directly related to missing data imputation methods, or where the primary focus was theoretical as opposed to practical were excluded.
Once an article was selected, it was reviewed for relevant data pertaining to missing data imputation methods. The context of the study, the datasets involved, and the results provided were also reviewed. The article analysis focused on:
Goal – understanding the specific goal of the study and how it relates to missing data imputation.
Methods/Techniques – Identifying the specific methods used, whether statistical or machine learning, or a combination of both.
Results/Applications – evaluating the effectiveness/efficiency of the methods employed. Showing how these methods were/can be applied in a real-world scenario.
Limitations/Challenges – noting the limitations or challenges involved in each method used.
Data – identifying the types of data used for testing (e.g. simulated, real-world) and the type of missing data (e.g. missing at random, missing completely at random, not missing at random)
Project Purpose
The purpose of this project is to demonstrate the use of several different types of missing data imputation methods using a test dataset with missing data. The performance of various imputation methods will be assessed using RMSE and R-Squared of the resulting predictive models after applying various imputation methods.
Methods
An overview of the methodology used to demonstrate different techniques for imputing missing data is shown below:
Figure 2: Overview of Methodology for Demonstrating Different Techniques for Missing Data Imputation. Figured generated by Robert Stairs, 14JUL2024.
Data were classified as either MCAR or not MCAR using Hawkin’s Test, Non-Parametric Test, and Little’s MCAR Test (Li 2013). The Hawkin’s Test checks for both multivariate normality and homoscedasticity (equal variances), and assumes normality. The Non-Parametric test checks for homoscedasticity without assuming normality. Lastly, the Little’s MCAR test is a chi-squared test that evaluates whether mean differences exists between groups with different missing data patterns. In these tests, the hypotheses are as follows:
Null Hypothesis (H0): The data is missing completely at random (MCAR).
Alternative Hypothesis (H1): The data is not missing completely at random.
Note, this procedure is only designed to test between MCAR/not MCAR and is not able to classify data as MAR or MNAR. Data visualization is recommended to provide supportive information to support the MAR assumptions.
Missing Value Deletion Methods
The following methods were used to delete missing data for this project:
List-wise Deletion: all rows containing missing values are removed from the dataset
Column Delete with Threshold (Feature Selection): if the total percentage of missing values in a column is over a threshold (20%), then the entire column/feature is removed.
Imputation Methods
The following methods were used to impute missing data for this project:
Mean imputation: missing values are replaced with the mean of the non-missing values of the corresponding feature.
Mode imputation: missing values are replaced with the mode of the non-missing values of the corresponding feature.
Median imputation: missing values are replaced with the median of the non-missing values of the corresponding feature.
MICE (multiple imputation by chained equations): missing data for a given column are imputed based on other columns with observed data over multiple iterations (Azur et al. 2011). It operates under the assumption that the data are missing at random (MAR). For a missing column, the missing values are imputed using regression based on observed data from other columns in the dataset. Then, the next column’s missing data are imputed using observed data in other columns and so forth until regression models are available for each column. Each regression model will have its own level of uncertainty. With each iteration, a possible missing value is imputed for a given data point given the regression function and its associated uncertainty. This process is repeated over multiple iterations, where the number of iterations is set by the user. This results in multiple datasets, which are then combined into a single dataset again. In this case, 50 iterations were used.
Figure 3: Multiple Imputation versus Single Imputation Methods, (Ren et al. 2023)
missForest: uses a random forest trained on observed data to predict missing values in continuous and categorical data. It is an ensemble machine learning method, where the data set is bootstrap sampled randomly and individual decision trees are constructed for each bootstrap (Tang 2017). The collection of decision trees is then used to impute a missing data point. This technique in machine learning helps prevent overfitting that decision trees are prone to.
Machine Learning Methods for Model-Fitting
Once missing values were imputed using each of the five methods outlined above, each resulting dataset (5) was used to fit models for Ozone levels. The datasets were split into training and test sets using a split ratio of 0.75. The following models were then used for training each dataset:
Decision Tree:
Decision trees are a tree-structured classifier used for both classification and regression. It splits the data into subsets based on the value of input features, forming a tree where each node represents a feature and each branch represents a decision rule. Advantages include: Easy to interpret and visualize, can handle mixed-type data, non-parametric making it flexible for various data distributions, requires little data pre-processing. Disadvantages include: Prone to overfitting, sensitive to small variations in the data, can create biased trees if some classes/features are dominant.
Figure 4: Visual representation of decision tree algorithm (https://365datascience.com/tutorials/machine-learning-tutorials/decision-trees/)
Random Forest:
Random forest is an ensemble learning method primarily used for classification and regression. It combines the predictions of multiple decision trees to improve accuracy and control over-fitting. It construct the decision trees during training and outputting the mode of the classes (classification) or mean prediction (regression) of the individual trees. Advantages include: high accuracy and robustness against over-fitting, works well with large datasets and high-dimensional spaces, handles both classification and regression tasks. Disadvantages include: can be computationally intensive and slower to predict, less interpretable compared to a single decision tree.
KNN is a simple, instance-based learning algorithm used for classification and regression. It classifies a data point based based on the stored instances from the training data without learning explicit parameters, but instead on how its neighbors are classified. Advantages include: simple and easy to implement, algorithm doesn’t have a separate training phase making it more flexible. Disadvantages include: Computationally expensive for large datasets due to the need to compute distances to all points, sensitive to irrelevant or redundant features, performance depends on the choice of the number of k and the distance metric used.
Figure 6: Visual representation of KNN algorithm, where k is the number of nearest neighbors (https://towardsdatascience.com/why-does-increasing-k-decrease-variance-in-knn-9ed6de2f5061)
The effectiveness of each imputation technique was evaluated using RMSE and R-Squared for the resulting Ozone level models. The RMSE and R-Squared for each imputation/model combination were reported and summarized in a data frame. The combination with the lowest RMSE was determined to be the most effective imputation technique and model fit for this dataset.
Analysis and Results
Description of the Dataset
The Ozone dataset (“Ozone: Los Angeles Ozone Pollution Data, 1976”) from the mlbench package in R was chosen for this project. It contains observations related to pollution levels in the Los Angeles area during 1976. The variables contained in the dataset are thought to influence ozone concentration. These include daily ozone measurements consisting of:
V1: Month, 1-12, where 1 is January and 12 is December
V2: Day of month
V3: Day of week, 1-7, where 1 is Monday and 7 is Sunday
V4: Daily maximum one-hour-average ozone reading
V5: 500 millibar pressure height (m) measured at Vandenberg AFB
V6: Wind speed (mph) at Los Angeles International Airport (LAX)
V7: Humidity (%) at LAX
V8: Temperature (degrees F) measured at Sandburg, CA
V9: Temperature (degrees F) measured at El Monte, CA
V10: Inversion base height (feet) at LAX
V11: Pressure gradient (mm Hg) from LAX to Daggett, CA
V12: Inversion base temperature (degrees F) at LAX
V13: Visibility (miles) measured at LAX
The dataset allows for analysis of how different meteorological conditions/factors can affect ozone levels. This dataset was chosen because it already has a high frequency of missing data, which is representative of a real-world application for missing data. It contains a total of 13 variables, 366 observations (one for each day for one year), and 139 missing or NA values for V9, which represents 38% of the total observations in the dataset. In practical applications, the missing values are unknown (not simulated), so it is up to the investigator to choose a methodology for handling the missing data. The investigator will also need to choose appropriate metrics for evaluating the effectiveness of different methods for handling missing data.
Description of the Imputation Workflow(RS,EM)
The data were first pre-processed. This step typically entails importing the data, checking for missing values, checking data types, and making transformations as needed. In this case, all columns were converted to numeric data type and renamed for easier interpretation. Missing values by column were reported in summary statistics, but were not addressed until further visualization was performed.
Following pre-processing, correlation coefficients were calculated for all variables with respect to Ozone levels (the response variable). Variables with strong positive or negative correlation were then visualized using histograms to show the distribution of each variable with respect to percentage of Ozone level readings. Missing data were then visualized to determine if there were any significant patterns to the missingness of the data. The purpose of this step is to provide evidence to support that the data are either MCAR, MAR, or MNAR.
Missing data were then imputed with various methodologies, including list-wise deletion, feature removal, mean imputation, mode imputation, median imputation, MICE, and missForest. This resulted in 7 unique datasets, where missing data were imputed using different metholodologies. To compare the effectiveness of each imputation method, each of the imputed datasets were split into test and training datasets. The training set was used to develop machine learning models to predict Ozone levels, including decision trees, random forest, and KNN. The resulting RMSE from each machine learning model and imputation method were used to identify the most effective imputation method and model combination for this dataset.
First, the data were imported and converted to a data frame. Next, columns were reordered such that V4 (response variable) is the first column. Lastly, the data were summarized using the summary() function. The summary function gives the number of NA’s per column. There are 139 missing values for V9 in the dataset. The number of missing values (NA’s) in each column are as follows:
V4: 5
V1: None
V2: None
V3: None
V5: 12
V6: None
V7: 15
V8: 2
V9: 139
V10: 15
V11: 1
V12: 14
V13: None
Import Data
Code
# import datadata("Ozone", package ="mlbench")# convert to data frameozone1 <-data.frame(Ozone)# Reorder columnsozone1 <- ozone1 %>% dplyr::select(V4,V1,V2,V3,V5,V6,V7,V8,V9,V10,V11,V12,V13)# View summary of datasummary(ozone1)
V4 V1 V2 V3 V5
Min. : 1.00 1 : 31 1 : 12 1:52 Min. :5320
1st Qu.: 5.00 3 : 31 2 : 12 2:52 1st Qu.:5700
Median : 9.00 5 : 31 3 : 12 3:52 Median :5770
Mean :11.53 7 : 31 4 : 12 4:53 Mean :5753
3rd Qu.:16.00 8 : 31 5 : 12 5:53 3rd Qu.:5830
Max. :38.00 10 : 31 6 : 12 6:52 Max. :5950
NA's :5 (Other):180 (Other):294 7:52 NA's :12
V6 V7 V8 V9
Min. : 0.000 Min. :19.00 Min. :25.00 Min. :27.68
1st Qu.: 3.000 1st Qu.:49.00 1st Qu.:51.00 1st Qu.:49.73
Median : 5.000 Median :65.00 Median :62.00 Median :57.02
Mean : 4.869 Mean :58.48 Mean :61.91 Mean :56.85
3rd Qu.: 6.000 3rd Qu.:73.00 3rd Qu.:72.00 3rd Qu.:66.11
Max. :11.000 Max. :93.00 Max. :93.00 Max. :82.58
NA's :15 NA's :2 NA's :139
V10 V11 V12 V13
Min. : 111 Min. :-69.0 Min. :27.50 Min. : 0.0
1st Qu.: 890 1st Qu.:-10.0 1st Qu.:51.26 1st Qu.: 70.0
Median :2125 Median : 24.0 Median :62.24 Median :110.0
Mean :2591 Mean : 17.8 Mean :60.93 Mean :123.3
3rd Qu.:5000 3rd Qu.: 45.0 3rd Qu.:70.52 3rd Qu.:150.0
Max. :5000 Max. :107.0 Max. :91.76 Max. :500.0
NA's :15 NA's :1 NA's :14
Data Pre-Processing
The data type was evaluated for each column. V1, V2, and V3 (Month, day of month, and day of week) were found to be “factor” data type, while the remaining columns were found to be be “numeric”. V1, V2, and V3 columns were converted to numeric data type.
Check data types
Code
# Create function to check data types of data frame columnscheck_data_types <-function(ozone1) {sapply(ozone1, class)}# Check data types of the columnsdata_types <-check_data_types(ozone1)print(data_types)
# Function to convert specified columns from factor to numericconvert_factors_to_numeric <-function(data, columns) { data[columns] <-lapply(data[columns], function(x) as.numeric(as.character(x)))return(data)}# Convert first 3 columns from factor to numericcolumns_to_convert <-c("V1", "V2", "V3")ozone_date <-convert_factors_to_numeric(ozone1, columns_to_convert)ozone2 <-convert_factors_to_numeric(ozone1, columns_to_convert)# Check data types of the columns againdata_types <-sapply(ozone_date, class)print(data_types)
Created a new column “Date” that combines date columns (month, day of month, day of week) into a single column.
Code
# combine Day and Month to create a Date columnozone_date <- ozone_date %>%mutate(Date =as.Date(paste(1976, V1, V2, sep ="-"), format ="%Y-%m-%d"))# Verify the new Date columnhead(ozone_date, n=10)
Features Legend: * Ozone_reading: Daily maximum one-hour-average ozone reading * Month: 1 = January, …, 12 = December * Day of month: 1-30/31 * Day of week: 1 = Monday, …, 7 = Sunday * Pressure_afb: 500 millibar pressure height (m) measured at Vandenberg AFB * Wind_speed_LAX: Wind speed (mph) at Los Angeles International Airport (LAX) * Humidity_LAX: Humidity (%) at LAX * Temp_sandburg: Temperature (degrees F) measured at Sandburg, CA * Temp_EM: Temperature (degrees F) measured at El Monte, CA * IBH_LAX: Inversion base height (feet) at LAX * Pressure_gradient: Pressure gradient (mm Hg) from LAX to Daggett, CA * IBT_LAX: Inversion base temperature (degrees F) at LAX * Visibility_LAX: Visibility (miles) measured at LAX
Data Exploration
Summary Statistics by Day of Week No notable trend in ozone levels by day of the week, except that mean and median ozone levels appear to be lower on day 4 of the week (Thursdays).
Code
# Summary statistics by day of the week ozone_summary_by_day <- ozone2 %>%group_by(Day_of_week) %>%summarize(mean_ozone =mean(Ozone_reading, na.rm =TRUE),median_ozone =median(Ozone_reading, na.rm =TRUE),max_ozone =max(Ozone_reading, na.rm =TRUE),min_ozone =min(Ozone_reading, na.rm =TRUE) )print(ozone_summary_by_day)
Summary Statistics by Month Mean and median ozone levels are lowest in the months of January, February, and March on average. They increase during the summer months, peaking in July, and gradually decline through the remainder of the year.
Examine how feature variables correlate with Ozone levels
Correlation coefficients were determined for each variable with respect to Ozone levels, the response variable. A bar graph was used to summarize the correlation coefficients. The variables with strong correlations were:
Strong Positive Correlations: Humidity_LAX, Pressure_afb, IBT_LAX, Temp_EM, and Temp_sandburg
Strong Negative Correlations: IBH_LAX and Visibility_LAX
Code
# Make sure data is in numeric formozone2[] <-lapply(ozone2, as.numeric)# Calculate correlations with Ozonecorr_coeffs <-cor(ozone2, use ="complete.obs")['Ozone_reading', ]corr_coeffs <- corr_coeffs[!names(corr_coeffs) %in%'Ozone_reading']# Create a data frame for plottingcorr_df <-data.frame(Variable =names(corr_coeffs), Correlation = corr_coeffs)# Create the bar graphggplot(corr_df, aes(x =reorder(Variable, Correlation), y = Correlation)) +geom_bar(stat ='identity',fill ="blue") +xlab('Variable') +ylab('Correlation Coefficient') +ggtitle('Correlation Between Variables and Daily Average Ozone Reading') +theme(axis.text.x =element_text(angle =90, hjust =1))
Explore Interactions with Positively Correlated Variables
Strong Positive Correlations: Humidity_LAX, Pressure_afb, IBT_LAX, Temp_EM, and Temp_sandburg
The percentage of ozone readings appears to be approximately normally distributed with respect to Pressure_afb, IBT_LAX, and Temp_sandburg. There appears to be some negative skew with respect to humidity, with a secondary mode at the lowest humidity bin (10-20). There are a lot of missing data points for Temp_EM (139 observations out of 366), as discussed in the initial data summary. This represents 38.0% of the observations for Temp_EM.
Code
## Humidity# Create bins for humidity levelsozone3 <- ozone2 %>%mutate(humidity_bin =cut(Humidity_LAX, breaks =seq(10, 100, by =10), include.lowest =TRUE))# Calculate percentage of ozone readings for each binpercentage_data1 <- ozone3 %>%group_by(humidity_bin) %>%summarize(Ozone_reading =n()) %>%mutate(percentage = (Ozone_reading /sum(Ozone_reading)) *100)# Create the bar graphggplot(percentage_data1, aes(x = humidity_bin, y = percentage)) +geom_bar(stat ="identity", fill ="skyblue", color ="black") +labs(title ="Percentage of Ozone Readings by Humidity Levels",x ="Humidity Levels",y ="Percentage of Ozone Readings") +theme_minimal()
Code
## Pressure# Create bins for humidity levelsozone3 <- ozone2 %>%mutate(pressure_bin =cut(Pressure_afb, breaks =seq(5300, 6000, by =100), include.lowest =TRUE))# Calculate percentage of ozone readings for each binpercentage_data2 <- ozone3 %>%group_by(pressure_bin) %>%summarize(Ozone_reading =n()) %>%mutate(percentage = (Ozone_reading /sum(Ozone_reading)) *100)# Create the bar graphggplot(percentage_data2, aes(x = pressure_bin, y = percentage)) +geom_bar(stat ="identity", fill ="skyblue", color ="black") +labs(title ="Percentage of Ozone Readings by Pressure Levels",x ="Pressure Levels",y ="Percentage of Ozone Readings") +theme(axis.text.x =element_text(angle =45, vjust=0.5, size=8))
Code
## IBT - Inversion base temperature (degrees F) at LAX# Create bins for Inversion base temp levelsozone3 <- ozone2 %>%mutate(IBT_bin =cut(IBT_LAX, breaks =seq(20, 100, by =10), include.lowest =TRUE))# Calculate percentage of ozone readings for each binpercentage_data3 <- ozone3 %>%group_by(IBT_bin) %>%summarize(Ozone_reading =n()) %>%mutate(percentage = (Ozone_reading /sum(Ozone_reading)) *100)# Create the bar graphggplot(percentage_data3, aes(x = IBT_bin, y = percentage)) +geom_bar(stat ="identity", fill ="skyblue", color ="black") +labs(title ="Percentage of Ozone Readings by Inversion base temp Levels",x ="Inversion base temp Levels",y ="Percentage of Ozone Readings") +theme_minimal()
Code
## Temp_EM - Temperature (degrees F) measured at El Monte, CA# Create bins for temp levelsozone3 <- ozone2 %>%mutate(Temp_EM_bin =cut(Temp_EM, breaks =seq(20, 100, by =10), include.lowest =TRUE))# Calculate percentage of ozone readings for each binpercentage_data4 <- ozone3 %>%group_by(Temp_EM_bin) %>%summarize(Ozone_reading =n()) %>%mutate(percentage = (Ozone_reading /sum(Ozone_reading)) *100)# Create the bar graphggplot(percentage_data4, aes(x = Temp_EM_bin, y = percentage)) +geom_bar(stat ="identity", fill ="skyblue", color ="black") +labs(title ="Percentage of Ozone Readings by Temp at El Monte Levels",x ="Temp Levels",y ="Percentage of Ozone Readings") +theme_minimal()
Code
## Temp_sandburg# Create bins for temp levelsozone3 <- ozone2 %>%mutate(Temp_sd_bin =cut(Temp_sandburg, breaks =seq(20, 100, by =10), include.lowest =TRUE))# Calculate percentage of ozone readings for each binpercentage_data5 <- ozone3 %>%group_by(Temp_sd_bin) %>%summarize(Ozone_reading =n()) %>%mutate(percentage = (Ozone_reading /sum(Ozone_reading)) *100)# Create the bar graphggplot(percentage_data5, aes(x = Temp_sd_bin, y = percentage)) +geom_bar(stat ="identity", fill ="skyblue", color ="black") +labs(title ="Percentage of Ozone Readings by Temp at Sandburg Levels",x ="Temp Levels",y ="Percentage of Ozone Readings") +theme_minimal()
Explore Interactions with Negatively Correlated Variables
Strong Negative Correlations: IBH_LAX and Visibility_LAX The percentage of ozone readings does not appear to be normally distributed with respect to IBH_LAX and Visbility_LAX. The highest proportion of Ozone readings are at high IBH (4600-5100), with a secondary mode at low IBH (600-1100). The proportion of Ozone readings are positively skewed with respect to Visbility_LAX, with the mode occurring between 50 and 100, and a secondary mode occuring between 250 and 300.
Code
## IBH_LAX - Inversion base height (feet) at LAX# Create bins for IBH levelsozone3 <- ozone2 %>%mutate(IBH_bin =cut(IBH_LAX, breaks =seq(100, 5500, by =500), include.lowest =TRUE))# Calculate percentage of ozone readings for each binpercentage_data6 <- ozone3 %>%group_by(IBH_bin) %>%summarize(Ozone_reading =n()) %>%mutate(percentage = (Ozone_reading /sum(Ozone_reading)) *100)# Create the bar graphggplot(percentage_data6, aes(x = IBH_bin, y = percentage)) +geom_bar(stat ="identity", fill ="skyblue", color ="black") +labs(title ="Percentage of Ozone Readings by IBH Levels",x ="IBH Levels",y ="Percentage of Ozone Readings") +theme(axis.text.x =element_text(angle =45, vjust=0.5))
Code
## Visibility# Create bins for Visibility levelsozone3 <- ozone2 %>%mutate(visibility_bin =cut(Visibility_LAX, breaks =seq(0, 500, by =50), include.lowest =TRUE))# Calculate percentage of ozone readings for each binpercentage_data7 <- ozone3 %>%group_by(visibility_bin) %>%summarize(Ozone_reading =n()) %>%mutate(percentage = (Ozone_reading /sum(Ozone_reading)) *100)# Create the bar graphggplot(percentage_data7, aes(x = visibility_bin, y = percentage)) +geom_bar(stat ="identity", fill ="skyblue", color ="black") +labs(title ="Percentage of Ozone Readings by Visibility Levels",x ="Visibility Levels",y ="Percentage of Ozone Readings") +theme_minimal()
Exploration of the Missing Data
The total number of missing data points for each column is shown below as a count and as a percentage. Most of the columns contain <5% missing values. Temp_EM contains 139 missing values, which is 38.0% of the observations. The total number of missing values in the dataset is 203 out of 4,768 (13 columns times 366 obervations). This represents 4.3% of the entire dataset.
Summarize missing values
Code
# Function to summarize missing values in a data framesummarize_missing_values <-function(data) { data %>%summarize_all(~sum(is.na(.))) %>%gather(key ="column", value ="missing_values") %>%mutate(missing_percentage = (missing_values /nrow(data)) *100)}missing_summary <-summarize_missing_values(ozone2)# Print the summary of missing valuesprint(missing_summary)
Three different plots were generating to visualize the missing data.
The first uses the vis_miss() function and the resulting plot shows the missing values shaded in black for each column and each row. Since the rows are arranged by date, we can see that the most of the data are missing in random clusters with respect to date. Temp_EM, which has the highest proportion of missing data (38%), appears to be missing in pretty regular intervals. This was further explored in the next section.
Code
# Plot missing data pattern#Shows what percentage of the data are missing from each columnvis_miss(ozone2)
The second plot uses the gg_miss_upset() function. This plot shows the number of missing values not only by each column, but by each possbile combination of missing values (intersections). For example, the third bar in this figure shows that there are 9 rows that are missing values for IBT_LAX, Humidity_LAX, and IBH_LAX. The majority of the rows with missing data appear to be only missing Temp_EM (127). The next highest intersection is Pressure_afb_NA, which contains 10 rows.
Code
#This plot gives a visual of what combinations of NAs are present and how many there are for each#set nsets to 8 since we have 8 columns with missing datagg_miss_upset(ozone2, nsets=8)
The third plot was generated using the gg_miss_var() function and shows the number of missing values for each variable. Again, Temp_EM has by far the largest number of missing points (139).
Code
#Another way to visualize number of missing rows per columngg_miss_var(ozone2) +labs(y ="Number of missing values") +ylim(0, 150)
Randomness Testing of Missing Data The Little’s MCAR test was performed to analyze randomness. * Null Hypothesis (H0): The data is missing completely at random (MCAR). * Alternative Hypothesis (H1): The data is not missing completely at random. Based on the p-value(4.102052e-12) from the test, the null hypothesis is rejected, data is not missing completely at random (MCAR).
##Additionally, the Hawkin’s and the Non-Parametric tests were performed to analyze missing data randomness.
Test Data for Normality In order to interpret the tests, the data must be checked for normality. The Shapiro-Wilk and the Anderson-Darling tests were performed.
Code
# Extract the dependent variableozone_levels <- ozone2$Ozone_reading# Remove NA valuesozone_levels <-na.omit(ozone_levels)# Perform normality testslibrary(nortest)shapiro_test <-shapiro.test(ozone_levels)ad_test <-ad.test(ozone_levels)# Print the resultsprint(shapiro_test)
Shapiro-Wilk normality test
data: ozone_levels
W = 0.91044, p-value = 8.37e-14
Code
print(ad_test)
Anderson-Darling normality test
data: ozone_levels
A = 10.027, p-value < 2.2e-16
A histogram and QQ plot were also created
Code
# Histogramggplot(ozone2, aes(x = Ozone_reading)) +geom_histogram(binwidth =5) +ggtitle("Histogram of Ozone")
Call:
MissMech::TestMCARNormality(data = .)
Number of Patterns: 4
Total number of cases used in the analysis: 354
Pattern(s) used:
Temp_EM Month Day_of_month Day_of_week Pressure_afb
group.1 NA 1 1 1 1
group.2 1 1 1 1 1
group.3 1 1 1 1 1
group.4 1 1 1 1 NA
Wind_speed_LAX Humidity_LAX Temp_sandburg IBH_LAX
group.1 1 1 1 1
group.2 1 1 1 1
group.3 1 NA 1 NA
group.4 1 1 1 1
Pressure_gradient IBT_LAX Visibility_LAX Number of cases
group.1 1 1 1 129
group.2 1 1 1 206
group.3 1 NA 1 9
group.4 1 1 1 10
Test of normality and Homoscedasticity:
-------------------------------------------
Hawkins Test:
P-value for the Hawkins test of normality and homoscedasticity: 0.0113103
Either the test of multivariate normality or homoscedasticity (or both) is rejected.
Provided that normality can be assumed, the hypothesis of MCAR is
rejected at 0.05 significance level.
Non-Parametric Test:
P-value for the non-parametric test of homoscedasticity: 0.2743229
Reject Normality at 0.05 significance level.
There is not sufficient evidence to reject MCAR at 0.05 significance level.
Hawkins Test
P-value for the Hawkins test of normality and homoscedasticity: 0.0113103
This test checks for both multivariate normality and homoscedasticity (equal variances).
Interpretation: Since the p-value is less than 0.05, the null hypothesis of multivariate normality or homoscedasticity (or both) is rejected. This means that the data does not follow a multivariate normal distribution, or the variances are not equal across groups.
MCAR Hypothesis: Provided that the above tests show that the data is not normally distributed, the hypothesis of MCAR (Missing Completely At Random) is not rejected at the 0.05 significance level.
Non-Parametric Test
P-value for the non-parametric test of homoscedasticity: 0.2743229
This test checks for homoscedasticity without assuming normality.
Interpretation: Since the p-value is greater than 0.05, there is not enough evidence to reject the null hypothesis of homoscedasticity. This suggests that the variances are equal across groups.
MCAR Hypothesis: There is not sufficient evidence to reject the hypothesis of MCAR at the 0.05 significance level.
Further Visualization of the Missing Data: Missingness Patterns
The data were further visualized using the gg_miss_fct() function to plot the proportion of missing values for each column (y-axis) broken down by one variable as a time (x-axis). The purpose of this is to determine if there are discernible patterns in the missingness of the data. This provides additional supportive information for classifying the data as MCAR, MAR, or MNAR.
Focusing on Temp_EM since it is the column with the highest number of missing values, the data appear to be missing randomly with respect to most of the variables. With respect to the date columns, the pattern of missingness appears to be random with respect to day of the month. However, it is apparent that there is a high proportion of missing data on days 6 and 7 of the week (Saturdays and Sundays). This indicates that Temp_EM was likely not measured on the weekends throughout the year. The values for Temp_EM also appear to be missing randomly with respect to month, though there is a notable high frequency of missing data around the month of May.
For the other variables in the dataset, there does not appear to be any obvious patterns to the missing data. Most importantly, the frequency of missing data for Temp_EM does not appear to depend on the output variable (ozone reading). Therefore, we can most likely proceed to imputation with the assumption that our data are missing completely at random (MCAR).
Code
# Create gg_miss_fct plots with adjusted themesp1 <-gg_miss_fct(ozone2, fct = Month) +ggtitle("Missing Data by Month") +theme(axis.text.x =element_text(angle =45, hjust =1)) +theme(plot.title =element_text(size=8))p2 <-gg_miss_fct(ozone2, fct = Day_of_month) +ggtitle("Missing Data by Day of Month") +theme(axis.text.x =element_text(angle =45, hjust =1)) +theme(plot.title =element_text(size=8))p3 <-gg_miss_fct(ozone2, fct = Day_of_week) +ggtitle("Missing Data by Day of Week") +theme(axis.text.x =element_text(angle =45, hjust =1)) +theme(plot.title =element_text(size=8))p4 <-gg_miss_fct(ozone2, fct = Ozone_reading) +ggtitle("Missing Data by Ozone Reading") +theme(axis.text.x =element_text(angle =45, hjust =1)) +theme(plot.title =element_text(size=8))p5 <-gg_miss_fct(ozone2, fct = Pressure_afb) +ggtitle("Missing Data by Solar Radiation") +theme(axis.text.x =element_text(angle =45, hjust =1)) +theme(plot.title =element_text(size=8))p6 <-gg_miss_fct(ozone2, fct = Wind_speed_LAX) +ggtitle("Missing Data by Wind Speed") +theme(axis.text.x =element_text(angle =45, hjust =1)) +theme(plot.title =element_text(size=8))p7 <-gg_miss_fct(ozone2, fct = Humidity_LAX) +ggtitle("Missing Data by Humidity (LAX)") +theme(axis.text.x =element_text(angle =45, hjust =1)) +theme(plot.title =element_text(size=8))p8 <-gg_miss_fct(ozone2, fct = Temp_sandburg) +ggtitle("Missing Data by Temperature (Sandburg)") +theme(axis.text.x =element_text(angle =45, hjust =1)) +theme(plot.title =element_text(size=8))p9 <-gg_miss_fct(ozone2, fct = Temp_EM) +ggtitle("Missing Data by Temperature (EM)") +theme(plot.title =element_text(size=8))p10 <-gg_miss_fct(ozone2, fct = IBH_LAX) +ggtitle("Missing Data by IBH_LAX") +theme(plot.title =element_text(size=8))p11 <-gg_miss_fct(ozone2, fct = Pressure_gradient) +ggtitle("Missing Data by Pressure Gradient") +theme(plot.title =element_text(size=8))p12 <-gg_miss_fct(ozone2, fct = IBT_LAX) +ggtitle("Missing Data by IBT_LAX") +theme(plot.title =element_text(size=8))p13 <-gg_miss_fct(ozone2, fct = Visibility_LAX) +ggtitle("Missing Data by Visibility_LAX") +theme(plot.title =element_text(size=8))# Arrange the plots into grids with proper spacinggrid1 <-grid.arrange(p1, p2, p3, p4, nrow =2)
Code
grid2 <-grid.arrange(p5, p6, p7, p8, nrow =2)
Code
grid3 <-grid.arrange(p9, p10, p11, nrow =2)
Code
grid4 <-grid.arrange(p12, p13, nrow =1)
Missing Data Imputation Using Simple Methods
Missing data were deleted using listwise deletion or feature selection as a baseline comparison for imputation methods. Additionally, missing data imputed using mean, median or mode as a demonstration of simple imputation techniques. Listwise deletion is simply deleting all rows that contain a missing value. Feature removal is where a column is deleted if its total percentage of missing values is over a set threshold (we chose 20%). With mean, median, or mode imputation, all NA values are replaced by a single value for each column: the mean, median, or mode of that column. One dataframe was created for each method, resulting in a total of five dataframes:
Feature selection (column deletion) In this case, the Temp_EM column is dropped because more than 20% of the values are missing. Remaining rows with NAs are also dropped so that machine learning models can be applied.
Code
# Function to drop column if quantity of missing values is over the thresholddrop_na_columns <-function(data, threshold) { na_counts <-colSums(is.na(data)) na_proportion <- na_counts /nrow(data) data <- data[, na_proportion <= threshold]return(data)}# Define threshold (e.g., 20% NA allowed)threshold <-0.20# Drop columns based on the NA thresholddropCol_data <-drop_na_columns(ozone2, threshold) # the column Temp_EM gets droppeddropCol_data <-data.frame(dropCol_data)dropCol_data <-na.omit(dropCol_data)head(dropCol_data)
Create a new dataset where all missing values are replaced with the mean of the column.
Code
# Function for mean imputationmean_impute <-function(x) { x[is.na(x)] <-mean(x, na.rm =TRUE)return(x)}imputed_data_mean <-apply(ozone2, 2, mean_impute)imputed_data_mean <-data.frame(imputed_data_mean)head(imputed_data_mean)
Create a new dataset where all missing values are replaced with the median of the column.
Code
# Function for median imputationmedian_impute <-function(x) { x[is.na(x)] <-median(x, na.rm =TRUE)return(x)}imputed_data_median <-apply(ozone2, 2, median_impute)imputed_data_median <-data.frame(imputed_data_median)head(imputed_data_median)
Create a new dataset where all missing values are replaced with the mode of the column.
Code
# Function for mode imputation (using the most common value)mode_impute <-function(x) { mode_val <-as.numeric(names(sort(table(x), decreasing =TRUE)[1])) x[is.na(x)] <- mode_valreturn(x)}imputed_data_mode <-apply(ozone2, 2, mode_impute)imputed_data_mode <-data.frame(imputed_data_mode)head(imputed_data_mode)
Missing Data Imputation Using MICE (Multiple Imputation Method) and missForest (Machine Learning Method)
Next, the missing data were imputed using MICE and Miss_Forest methods.
missForest
Code
library(missForest)# verbose = If 'TRUE' the user is supplied with additional output between iterations# xtrue = Complete data matrixozone2_mf <-missForest(ozone2, xtrue = ozone2, verbose =TRUE)# convert back to data frameozone2_mf <-as.data.frame(ozone2_mf$ximp)print(sum(is.na(ozone2_mf)))## The final results can be accessed directly. The estimated error:ozone2_mf$OOBerror## The true imputation error (if available):ozone2_mf$error
MICE
Code
library(mice)# Impute missing values using MICE# pmm = Predictive Mean Matching (suitable for continuous variables like temperature, wind, etc.)# m = 5: number of imputed datasets to create.# maxit = 50: max number of iterationsozone2_mice <-mice(ozone2, method ="pmm", m =5, maxit =50)# extracts the completed datasets from the mice objectozone2_mice <-complete(ozone2_mice)# Convert completed data to data frameozone2_mice <-as.data.frame(ozone2_mice)print(sum(is.na(ozone2_mice)))
Model-Fitting of Imputed Datasets
Make Predictions Using Data Where Missing Values Were Deleted Using Listwise Deletion (Deletion of all missing rows with missing data)
The listwise deleted dataset was split into testing and training datasets using a split ratio of 0.75. Models were then fit using the following algorithms:
Random Forest
K-Nearest Neighbors (KNN)
Decision Tree
After models were fit to the training data, predictions were made on the unseen test data. RMSE was then reported for each model.
Load additional libraries and split the data into training and testing sets
Code
#Load additional libraries for model fittinglibrary(caret) # for fitting KNN modelslibrary(e1071)library(rsample) # for creating validation splitslibrary(recipes) # for feature engineeringlibrary(randomForest)library(rpart)# decision treelibrary(tidymodels) library(class) library(vip) # Split the data into training and testing setsset.seed(123)data_split_dropNA <-initial_split(dropNA_data, prop =0.75)train_dropNA <-training(data_split_dropNA)test_dropNA <-testing(data_split_dropNA)# Split data into predictors and targetX <- train_dropNA[, -1] # Featuresy <- train_dropNA$Ozone_reading # Target# Function to calculate RMSErmse <-function(pred, actual) {sqrt(mean((pred - actual)^2))}
Random Forest
Code
# Random forest modelrf_dropNA <-randomForest(Ozone_reading ~ ., data = train_dropNA)# make predictionsrf_dropNA_pred <-predict(rf_dropNA, newdata = test_dropNA)# Plot variable importancevarImpPlot(rf_dropNA, main ="Variable Importance Plot for Random Forest Model")
# Decision tree modeltree_dropNA <-rpart(Ozone_reading ~ ., data = train_dropNA)# make predictionstree_dropNA_pred <-predict(tree_dropNA, newdata = test_dropNA)# Plot variable importancevar_importance <-varImp(tree_dropNA)barplot(var_importance$Overall, main ="Variable Importance for Decision Tree", xlab ="Variable", ylab ="Importance")
Code
tree_dropNA_rmse <-rmse(tree_dropNA_pred, test_dropNA$Ozone_reading)cat("Decision Tree RMSE:", tree_dropNA_rmse, "\n")
Decision Tree RMSE: 5.799608
Make Predictions Using Data Where Columns with >20% Missing Data Were Deleted (Temp_EM column)
The deleted column dataset was split into testing and training datasets using a split ratio of 0.75. Models were then fit using the following algorithms:
Random Forest
K-Nearest Neighbors (KNN)
Decision Tree
After models were fit to the training data, predictions were made on the unseen test data. RMSE was then reported for each model.
For the column deletion method (feature selection), random forest had the best model fit, with an RMSE of 3.66.
Code
# Split the data into training and testing setsset.seed(123)data_split_dropCol <-initial_split(dropCol_data, prop =0.75)train_dropCol <-training(data_split_dropCol)test_dropCol <-testing(data_split_dropCol)# Split data into predictors and targetX <- train_dropCol[, -1] # Featuresy <- train_dropCol$Ozone_reading # Target
Random Forest
Code
# Random forest modelrf_dropCol <-randomForest(Ozone_reading ~ ., data = train_dropCol)# make predictionsrf_dropCol_pred <-predict(rf_dropCol, newdata = test_dropCol)# Plot variable importancevarImpPlot(rf_dropCol, main ="Variable Importance Plot for Random Forest Model")
# Decision tree modeltree_dropCol <-rpart(Ozone_reading ~ ., data = train_dropCol)# make predictionstree_dropCol_pred <-predict(tree_dropCol, newdata = test_dropCol)# Plot variable importancevar_importance <-varImp(tree_dropCol)barplot(var_importance$Overall, main ="Variable Importance for Decision Tree", xlab ="Variable", ylab ="Importance")
Code
tree_dropCol_rmse <-rmse(tree_dropCol_pred, test_dropCol$Ozone_reading)cat("Decision Tree RMSE:", tree_dropCol_rmse, "\n")
Decision Tree RMSE: 4.266184
Make Predictions Using Data Where Missing Values were Imputed using the Mean
The mean imputed dataset was split into testing and training datasets using a split ratio of 0.75. Models were then fit using the following algorithms:
Random Forest
K-Nearest Neighbors (KNN)
Decision Tree
After models were fit to the training data, predictions were made on the unseen test data. RMSE was then reported for each model.
For the mean imputation method, random forest had the best model fit, with an RMSE of 4.90.
Code
# Split the data into training and testing setsset.seed(123)data_split_mean <-initial_split(imputed_data_mean, prop =0.75)train_mean <-training(data_split_mean)test_mean <-testing(data_split_mean)# Split data into predictors and targetX <- train_mean[, -1] # Featuresy <- train_mean$Ozone_reading # Target# Random forest modelrf_mean <-randomForest(Ozone_reading ~ ., data = train_mean)# make predictionsrf_mean_pred <-predict(rf_mean, newdata = test_mean)# Plot variable importancevarImpPlot(rf_mean, main ="Variable Importance Plot for Random Forest Model")
# Decision tree modeltree_mean <-rpart(Ozone_reading ~ ., data = train_mean)# make predictionstree_mean_pred <-predict(tree_mean, newdata = test_mean)# Plot variable importancevar_importance <-varImp(tree_mean)barplot(var_importance$Overall, main ="Variable Importance for Decision Tree", xlab ="Variable", ylab ="Importance")
Code
tree_mean_rmse <-rmse(tree_mean_pred, test_mean$Ozone_reading)cat("Decision Tree RMSE:", tree_mean_rmse, "\n")
Decision Tree RMSE: 5.213247
Make Predictions Using Data Where Missing Values were Imputed using the Median
The mean imputed dataset was split into testing and training datasets using a split ratio of 0.75. Models were then fit using the following algorithms:
Random Forest
K-Nearest Neighbors (KNN)
Decision Tree
After models were fit to the training data, predictions were made on the unseen test data. RMSE was then reported for each model.
For the mean imputation method, random forest had the best model fit, with an RMSE of 5.07.
Code
# Split the data into training and testing setsset.seed(123)data_split_med <-initial_split(imputed_data_median, prop =0.75)train_med <-training(data_split_med)test_med <-testing(data_split_med)
Random Forest
Code
# Random forest modelrf_med <-randomForest(Ozone_reading ~ ., data = train_med)rf_med_pred <-predict(rf_med, newdata = test_med)# Plot variable importancevarImpPlot(rf_med, main ="Variable Importance Plot for Random Forest Model")
# Decision tree modeltree_med <-rpart(Ozone_reading ~ ., data = train_med)tree_med_pred <-predict(tree_med, newdata = test_med)# Plot variable importancevar_importance <-varImp(tree_med)barplot(var_importance$Overall, main ="Variable Importance for Decision Tree", xlab ="Variable", ylab ="Importance")
Code
tree_med_rmse <-rmse(tree_med_pred, test_med$Ozone_reading)cat("Decision Tree RMSE:", tree_med_rmse, "\n")
Decision Tree RMSE: 5.249752
Make Predictions Using Data Where Missing Values were Imputed using the Mode
The mean imputed dataset was split into testing and training datasets using a split ratio of 0.75. Models were then fit using the following algorithms:
Random Forest
K-Nearest Neighbors (KNN)
Decision Tree
After models were fit to the training data, predictions were made on the unseen test data. RMSE was then reported for each model.
For the mode imputation method, random forest had the best model fit, with an RMSE of 5.40.
Code
# Split the data into training and testing setsset.seed(123)data_split_mode <-initial_split(imputed_data_mode, prop =0.75)train_mode <-training(data_split_mode)test_mode <-testing(data_split_mode)
Random Forest
Code
# Random forest modelrf_mode <-randomForest(Ozone_reading ~ ., data = train_mode)rf_mode_pred <-predict(rf_mode, newdata = test_mode)# Plot variable importancevarImpPlot(rf_mode, main ="Variable Importance Plot for Random Forest Model")
# Decision tree modeltree_mode <-rpart(Ozone_reading ~ ., data = train_mode)tree_mode_pred <-predict(tree_mode, newdata = test_mode)# Plot variable importancevar_importance <-varImp(tree_mode)barplot(var_importance$Overall, main ="Variable Importance for Decision Tree", xlab ="Variable", ylab ="Importance")
Code
tree_mode_rmse <-rmse(tree_mode_pred, test_mode$Ozone_reading)cat("Decision Tree RMSE:", tree_mode_rmse, "\n")
Decision Tree RMSE: 6.603588
Make Predictions using data where missing values were imputed with complex methods:
Make Predictions Using Data Where Missing Values were Imputed using missForest Method
The missForest imputed dataset was split into testing and training datasets using a split ratio of 0.75. Models were then fit using the following algorithms:
Random Forest
K-Nearest Neighbors (KNN)
Decision Tree
After models were fit to the training data, predictions were made on the unseen test data. RMSE was then reported for each model.
For the missForest imputation method, random forest had the best model fit, with an RMSE of 4.46.
Code
# Split the data into training and testing setsset.seed(123)data_split_miss <-initial_split(ozone2_mf, prop =0.75)train_miss <-training(data_split_miss)test_miss <-testing(data_split_miss)
Random Forest
Code
# Random forest modelrf_miss <-randomForest(Ozone_reading ~ ., data = train_miss)rf_miss_pred <-predict(rf_miss, newdata = test_miss)# Plot variable importancevarImpPlot(rf_miss, main ="Variable Importance Plot for Random Forest Model")
# Decision tree modeltree_miss <-rpart(Ozone_reading ~ ., data = train_miss)tree_miss_pred <-predict(tree_miss, newdata = test_miss)# Plot variable importancevar_importance <-varImp(tree_miss)barplot(var_importance$Overall, main ="Variable Importance for Decision Tree", xlab ="Variable", ylab ="Importance")
Code
tree_miss_rmse <-rmse(tree_miss_pred, test_miss$Ozone_reading)cat("Decision Tree RMSE:", tree_miss_rmse, "\n")
Decision Tree RMSE: 5.537124
Make Predictions Using Data Where Missing Values were Imputed using MICE Method
The MICE imputed dataset was split into testing and training datasets using a split ratio of 0.75. Models were then fit using the following algorithms:
Random Forest
K-Nearest Neighbors (KNN)
Decision Tree
After models were fit to the training data, predictions were made on the unseen test data. RMSE was then reported for each model.
For the MICE imputation method, random forest had the best model fit, with an RMSE of 4.70.
Code
# Split the data into training and testing setsset.seed(123)data_split_mice <-initial_split(ozone2_mice, prop =0.75)train_mice <-training(data_split_mice)test_mice <-testing(data_split_mice)
Random Forest
Code
# Random forest modelrf_mice <-randomForest(Ozone_reading ~ ., data = train_mice)rf_mice_pred <-predict(rf_mice, newdata = test_mice)# Plot variable importancevarImpPlot(rf_mice, main ="Variable Importance Plot for Random Forest Model")
# Decision tree modeltree_mice <-rpart(Ozone_reading ~ ., data = train_mice)tree_mice_pred <-predict(tree_mice, newdata = test_mice)# Plot variable importancevar_importance <-varImp(tree_mice)barplot(var_importance$Overall, main ="Variable Importance for Decision Tree", xlab ="Variable", ylab ="Importance")
Code
tree_mice_rmse <-rmse(tree_mice_pred, test_mice$Ozone_reading)cat("Decision Tree RMSE:", tree_mice_rmse, "\n")
Decision Tree RMSE: 5.077593
Summarize Model Performance for Each Imputation Methodology
The RMSE for each model and imputation method combination was summarized into a data frame. The results are shown below. The best model fit (lowest RMSE) was obtained when using feature selection (column deletion) for imputation of missing data and the random forest algorithm for model training with the resulting dataset. Mean imputation with the Decision Tree model fitting was the second best combination, followed by mode imputation with Random Forest.
In summary, maintaining the integrity and dependability of data analysis procedures depends on the management of missing data. This work demonstrates various methodologies that can be used to handle missing data, including listwise and feature selection deletion methods, machine learning-based algorithms like missForest, single imputation techniques such as mean, mode, median imputation, and multiple imputation using MICE. The dataset “Ozone”, from the mlbench package in R was used to demonstrate the techniques for either deleting or imputing missing data. Statistical tests and visualization techniques to help classify data as MCAR, MAR, and NMAR were also demonstrated. With respect to the “Ozone” dataset, deletion of the column Temp_EM, which contains ~38% missing data, was found to be the most effective method for handling missing data. This was demonstrated through machine learning models trained to datasets where missing data were handled using the various techniques described above. A total of seven datasets were created, and each dataset was trained and tested with KNN, decision tree, and random forest algorithms. The random forest model in combination with feature selection for handling missing data resulted in the lowest RMSE with respect to the test (unseen) dataset. This may have been the best result in this instance because this data appear to be MCAR based on statistical testing. It is best practice to test a variety of different methods, as was performed in this work, and predefine success criteria when handling missing data. To provide reliable and accurate results from data analysis, the imputation method selected should ultimately be in line with the unique features of the dataset and the analytical objectives.
References
Awan, Bennamoun, S. E. 2022. “A Reinforcement Learning-Based Approach for Imputing Missing Data.”Neural Computing and Applications 34 (12): 9701–16.
Azur, Melissa J, Elizabeth A Stuart, Constantine Frangakis, and Philip J Leaf. 2011. “Multiple Imputation by Chained Equations: What Is It and How Does It Work?”International Journal of Methods in Psychiatric Research 20 (1): 40–49.
Batista, Gustavo EAPA, and Maria Carolina Monard. 2003. “An Analysis of Four Missing Data Treatment Methods for Supervised Learning.”Applied Artificial Intelligence 17 (5-6): 519–33.
Bennett, Derrick A. 2001. “How Can i Deal with Missing Data in My Study?”Australian and New Zealand Journal of Public Health 25 (5): 464–69.
Du, Jinghan, Minghua Hu, and Weining Zhang. 2020. “Missing Data Problem in the Monitoring System: A Review.”IEEE Sensors Journal 20 (23): 13984–98.
Emmanuel, Maupong, T. 2021. “A Survey on Missing Data in Machine Learning.”J Big Data 8: 401–7.
Ghazali, Shamihah Muhammad, Norshahida Shaadan, and Zainura Idrus. 2020. “Missing Data Exploration in Air Quality Data Set Using r-Package Data Visualisation Tools.”Bulletin of Electrical Engineering and Informatics 9 (2): 755–63.
Guzel, Kaya, C. 2013. “Breast Cancer Diagnosis Based on Naïve Bayes Machine Learning Classifier with KNN Missing Data Imputation.”AWERProcedia Information Technology & Computer Science 4: 332–46.
Ibrahim, Chen, J. G. 2011. “Missing-Data Methods for Generalized Linear Models.”Journal of the American Statistical Association 100 (469).
Jerez, Molina, J. M. 2010. “Missing Data Imputation Using Statistical and Machine Learning Methods in a Real Breast Cancer Problem.”Artificial Intelligence in Medicine 50 (2): 105–15.
Li, Cheng. 2013. “Little’s Test of Missing Completely at Random.”The Stata Journal 13 (4): 795–809.
Lin, Wei-Chao, and Chih-Fong Tsai. 2020. “Missing Value Imputation: A Review and Analysis of the Literature (2006–2017).”Artificial Intelligence Review 53: 1487–1509.
Little, Todd D, Terrence D Jorgensen, Kyle M Lang, and E Whitney G Moore. 2014. “On the Joys of Missing Data.”Journal of Pediatric Psychology 39 (2): 151–62.
Mack, Christina, Zhaohui Su, and Daniel Westreich. 2018. “Managing Missing Data in Patient Registries: Addendum to Registries for Evaluating Patient Outcomes: A User’s Guide.”
May, Michael, Christine Körner, Dirk Hecker, Martial Pasquier, Urs Hofmann, and Felix Mende. 2009. “Modelling Missing Values for Audience Measurement in Outdoor Advertising Using GPS Data.” In GI Jahrestagung, 3993–4006.
Ren, Lijuan, Tao Wang, Aicha Sekhari Seklouli, Haiqing Zhang, and Abdelaziz Bouras. 2023. “A Review on Missing Values for Main Challenges and Methods.”Information Systems, 102268.
Rubin, Donald B. 2004. Multiple Imputation for Nonresponse in Surveys. Vol. 81. John Wiley & Sons.
Schafer, Joseph L. 1997. Analysis of Incomplete Multivariate Data. CRC press.
Schefer, J. 2002. “Dealing with Missing Data.”Research Letters in the Information and Mathematical Sciences 3 (1): 153–60.
Tang, Fei. 2017. Random Forest Missing Data Approaches. University of Miami.
Zhang, Sun, X. 2024. “A Matrix Completion Method for Imputing Missing Values of Process Data.”Processes 12 (4): 659.
Zheng, Siming, Juan Zhang, and Yong Zhou. 2023. “Likelihood Identifiability and Parameter Estimation with Nonignorable Missing Data.”Canadian Journal of Statistics 51 (4): 1190–1209.